Skip to content

[python] Fix read_paimon ArrowInvalid on PK tables with single-snapshot data#7820

Merged
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-ray-read-nullable-schema
May 12, 2026
Merged

[python] Fix read_paimon ArrowInvalid on PK tables with single-snapshot data#7820
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-ray-read-nullable-schema

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Purpose

read_paimon() crashes with pyarrow.lib.ArrowInvalid when reading
a primary-key table whose data consists of a single snapshot (all
splits are raw-convertible). The issue is in
RayDatasource._get_read_task:

yield pyarrow.Table.from_batches([batch], schema=schema)

schema comes from PyarrowFieldParser.from_paimon_schema and marks
PK columns as NOT NULL. The batch from the Parquet reader may have
those columns as nullable — from_batches does a strict schema equality
check (including the nullable bit) and rejects the mismatch.

This is a pre-existing issue on master. It was never triggered by
existing tests because they all write multiple snapshots (creating
non-raw-convertible splits that go through the merge-read path, which
preserves nullability).

Linked Issue

Discovered while testing PR #7813 on CI (Python 3.10 / pyarrow in the
CI container triggers the strict check; newer pyarrow on local dev
machines is more lenient).

Fix

Replace the strict from_batches([batch], schema=schema) with:

table = pyarrow.Table.from_batches([batch])
if table.schema != schema:
    table = table.cast(schema)
yield table

Table.cast(target_schema) is a zero-copy metadata-only operation for
nullable→not-null diffs. It also handles other type promotions (e.g.
large_string → string) that may occur on some Ray versions.

When schemas already match, the if branch is skipped — zero overhead.

Tests

Added test_read_paimon_pk_single_snapshot: PK table + single write +
read_paimon() — verifies no ArrowInvalid on raw-convertible splits.

All existing ray_integration_test.py tests remain green.

API & Format Impact

None. Pure internal fix in the Ray read task function.

Documentation Impact

None.

Generative AI Disclosure

Drafted with Claude Code assistance, reviewed and tested by the author.

…ible splits

Table.from_batches rejects batches whose schema differs from the
declared schema in the nullable bit — PK columns are marked NOT NULL
in the Paimon schema but the Parquet reader may produce nullable
fields on certain pyarrow versions. Use Table.cast to align the batch
schema before yielding to Ray, which is a zero-copy metadata-only
operation for nullable diffs.

This fixes read_paimon crashing with ArrowInvalid on PK tables where
all splits are raw-convertible (e.g. single-snapshot data with no
overlapping keys).
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 86e6eed into apache:master May 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants